--- title: Benford's Law Application and Interpretation date: 2024-02-10 categories: - Pandas - Accounting ---

Import packages

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

🎩 Benford's Law

Benford's Law, also known as the Newcomb-Benford law or the first-digit law, is a surprising observation about the leading digits of numbers in real-world datasets. In many naturally occurring collections of data, smaller leading digits (like 1 and 2) are significantly more common than larger ones (like 8 and 9).

  • Financial records
  • Scientific measurements
  • Astronomical distances
  • Street addresses

Why does this happen?

Real-world data often involves growth, multiplication, and comparisons across different scales. This "scaling invariance" creates a natural bias towards smaller leading digits.

How can Benford's Law be useful?

Benford's Law can be a quick and a powerful tool for detecting anomalies or fraud in data. If a dataset supposedly reflects real-world data but significantly deviates from Benford's Law, it might indicate manipulated or fabricated numbers.


💵 Real-world example using P-Card transactions

Read dataset

In [2]:
df = pd.read_csv('DC_PCard_Transactions.csv')
df.head(3)
Out[2]:
AGENCY TRANSACTION_DATE TRANSACTION_AMOUNT VENDOR_NAME VENDOR_STATE_PROVINCE MCC_DESCRIPTION DCS_LAST_MOD_DTTM OBJECTID
0 Office of Latino Affairs 2009/01/05 05:00:00+00 16.80 USPS 1050050275 QQQ DC Postage Services-Government Only 2009/04/28 20:57:31+00 1
1 Department of Mental Health 2009/01/05 05:00:00+00 229.50 WW GRAINGER 912 DC Industrial Supplies, Not Elsewhere Classified 2009/04/28 20:57:31+00 2
2 District Department of Transportation 2009/01/05 05:00:00+00 3147.33 BRANCH SUPPLY DC Stationery, Office & School Supply Stores 2009/04/28 20:57:31+00 3
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 534641 entries, 0 to 534640
Data columns (total 8 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   AGENCY                 534641 non-null  object 
 1   TRANSACTION_DATE       534641 non-null  object 
 2   TRANSACTION_AMOUNT     534641 non-null  float64
 3   VENDOR_NAME            534602 non-null  object 
 4   VENDOR_STATE_PROVINCE  533012 non-null  object 
 5   MCC_DESCRIPTION        534623 non-null  object 
 6   DCS_LAST_MOD_DTTM      534641 non-null  object 
 7   OBJECTID               534641 non-null  int64  
dtypes: float64(1), int64(1), object(6)
memory usage: 32.6+ MB

Benford's Law Analysis - First Digit

The code below grabs the first digits of the 'TRANSACTION_AMOUNT' column after converting the column into a string type.

In [4]:
# remove transactions with amounts that are negative or has a leading zero
# retrieve the first digit and use value_counts to find frequency
df_benford = df['TRANSACTION_AMOUNT'] \
    [df['TRANSACTION_AMOUNT'] >= 1] \
    .astype(str).str[0] \
    .value_counts() \
    .to_frame(name="count") \
    .reset_index(names="first_digit") \
    .sort_values('first_digit')

# calculate percentages
df_benford['actual_proportion'] = df_benford['count'] / df_benford['count'].sum()
df_benford
Out[4]:
first_digit count actual_proportion
0 1 150073 0.293817
1 2 99240 0.194295
2 3 62232 0.121840
3 4 51923 0.101656
4 5 42581 0.083366
5 6 30417 0.059551
6 7 27649 0.054132
8 8 23169 0.045361
7 9 23486 0.045982

Benford's proposed distribution of leading digit frequencies is given by

\begin{equation} P_i=\log _{10}\left(\frac{i+1}{i}\right) ; \quad i \in\{1,2,3, \ldots, 9\}, \end{equation}

where $P_i$ is the probability of finding $i$ as the leading digit in a given number.

Create a new column that contains the Benford's proposed distribution of leading digit frequencies.

In [5]:
# append an expected_proportion column that contains Benford's distribution
df_benford['benford_proportion'] = [np.log10(1 + 1 / i) for i in np.arange(1, 10)]
df_benford
Out[5]:
first_digit count actual_proportion benford_proportion
0 1 150073 0.293817 0.301030
1 2 99240 0.194295 0.176091
2 3 62232 0.121840 0.124939
3 4 51923 0.101656 0.096910
4 5 42581 0.083366 0.079181
5 6 30417 0.059551 0.066947
6 7 27649 0.054132 0.057992
8 8 23169 0.045361 0.051153
7 9 23486 0.045982 0.045757

Plot distributions

In [6]:
fig = px.bar(
    data_frame=df_benford,
    x='first_digit',
    y=['actual_proportion', 'benford_proportion'],
    title='<b>Proportions of Leading Digits for P-Card Transactions</b><br>\
<span style="color: #aaa">Compared with Benford\'s Proposed Proportions</span>',
    labels={
        'first_digit': 'First Digit',
    },
    height=500,
    barmode='group',
    template='simple_white'
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
    yaxis_title_text='Proportion',
    yaxis_tickformat=',.0%',
    legend_title=None,
)

fig.data[0].name = 'Actual'
fig.data[1].name = 'Benford'

fig.show()

Judging whether a dataset deviates from Benford's Law

There are several ways to judge whether a dataset deviates from Benford's Law, and the choice of method depends on the size and complexity of your data. Here are some common approaches:

  1. Visual Inspection
  2. Chi-Square Test ($\chi^2$)
  3. Mean Absolute Deviation (MAD)
  4. Sum of Squared Differences (SSD)

Visual inspection is the simplest method, but does not generate numeric measures for comparison. Looking at the histogram above does not display any significant deviations. But this approach creates ambuigity in drawing a conclusion when deviation starts to widen.

The other three methods are empirically-based criteria for conformity to Benford's Law.

Although $\chi^2$ is the most common measure of conformity to Benford's Law, research suggests that $\chi^2$ is widely misused and misinterpreted.

This notebook will use cover using Mean Absolute Deviation (MAD) and Sum of Squared Differences (SSD). Both MAD and SSD are easy to calculate, provides a single numeric value summarizing the deviation, and useful for comparing deviations across different datasets.

Mean Absolute Deviation (MAD)

A MAD measure is given by

\begin{equation} \mathrm{MAD}=\frac{\sum_{i=1}^K|AP-EP|}{K}, \end{equation}

where

  • $K$ is the number of leading digit bins (9 for first leading digit; 90 for first two leading digits),
  • $i$ is a leading digit (between 1 and 9),
  • $AP$ is the actual proportion observed,
  • $EP$ is the expected proportion according to Benford's Law.
In [7]:
mad = abs(df_benford['actual_proportion'] - df_benford['benford_proportion']).sum() / df_benford.shape[0]
print(f'MAD value = {round(mad, 4)}')
MAD value = 0.0061

Interpreting MAD value

Nigrini's study suggests the following MAD ranges and conclusions for first digits.

MAD Range Conformity
0.000 to 0.006 Close confirmity
0.006 to 0.012 Acceptable conformity
0.012 to 0.015 Marginally acceptable conformity
Above 0.015 Nonconformity
In [8]:
def MAD_benford_interpretation(mad):
    if mad < 0.006:
        return 'Close Conformity'
    elif mad < 0.012:
        return 'Acceptable Conformity'
    elif mad < 0.015:
        return 'Marginal Conformity'
    else:
        return 'Non-conforming'
In [9]:
print(f'MAD value of {round(mad, 4)} can be interpreted as {MAD_benford_interpretation(mad)}.')
MAD value of 0.0061 can be interpreted as Acceptable Conformity.

Sum of Squared Differences (SSD)

A SSD measure is given by

\begin{equation} \mathrm{SSD}=\sum_{i=1}^K(A P-E P)^2 \times 10^4 \end{equation}

where

  • $K$ is the number of leading digit bins (9 for first leading digit; 90 for first two leading digits),
  • $i$ is a leading digit (between 1 and 9),
  • $AP$ is the actual proportion observed,
  • $EP$ is the expected proportion according to Benford's Law.
In [10]:
ssd = sum(((df_benford['actual_proportion'] - df_benford['benford_proportion']) ** 2) * (10 ** 4))
print(f'SSD value = {round(ssd, 4)}')
SSD value = 5.3623

Interpreting SSD value

Kossovsky's study suggests the following SSD ranges and conclusions for first digits.

SSD Range Conformity
0 to 2 Perfect confirmity
2 to 25 Acceptable conformity
25-100 Marginally conformity
Above 100 Nonconformity
In [11]:
def SSD_benford_interepretation(ssd):
    if ssd < 2:
        return 'Perfect Conformity'
    elif ssd < 25:
        return 'Acceptable Conformity'
    elif ssd < 100:
        return 'Marginal Conformity'
    else:
        return 'Non-conforming'
In [12]:
print(f'SSD value of {round(ssd, 1)} can be interpreted as {SSD_benford_interepretation(ssd)}.')
SSD value of 5.4 can be interpreted as Acceptable Conformity.

Both MAD and SSD measures indicate "Acceptable conformity".


Benford's Law Analysis - First Two Digits

The code below grabs the first digits of the 'TRANSACTION_AMOUNT' column after converting the column into a string type.

In [13]:
# only keep transactions with an amount greater than or equal to $10
# retrieve the first two digits and use value_counts to find frequency
# use reset_index() for a clean index from 0 to 89 (optional)
df_benford2 = df['TRANSACTION_AMOUNT'] \
    [df['TRANSACTION_AMOUNT'] >= 10] \
    .astype(str).str[:2] \
    .value_counts() \
    .to_frame(name="count") \
    .reset_index(names="first_two_digits") \
    .sort_values('first_two_digits') \
    .reset_index(drop=True)

# calculate percentages
df_benford2['actual_proportion'] = df_benford2['count'] / df_benford2['count'].sum()
df_benford2
Out[13]:
first_two_digits count actual_proportion
0 10 23603 0.047374
1 11 16148 0.032411
2 12 16886 0.033892
3 13 13807 0.027712
4 14 13892 0.027883
... ... ... ...
85 95 2853 0.005726
86 96 1689 0.003390
87 97 1784 0.003581
88 98 1570 0.003151
89 99 4190 0.008410

90 rows × 3 columns

In [14]:
# append an expected_proportion column that contains Benford's distribution
df_benford2['benford_proportion'] = [np.log10(1 + 1 / i) for i in np.arange(10, 100)]
df_benford2
Out[14]:
first_two_digits count actual_proportion benford_proportion
0 10 23603 0.047374 0.041393
1 11 16148 0.032411 0.037789
2 12 16886 0.033892 0.034762
3 13 13807 0.027712 0.032185
4 14 13892 0.027883 0.029963
... ... ... ... ...
85 95 2853 0.005726 0.004548
86 96 1689 0.003390 0.004501
87 97 1784 0.003581 0.004454
88 98 1570 0.003151 0.004409
89 99 4190 0.008410 0.004365

90 rows × 4 columns

Plot distributions

In [15]:
fig = px.bar(
    data_frame=df_benford2,
    x='first_two_digits',
    y=['actual_proportion', 'benford_proportion'],
    title='<b>Proportions of Leading Two Digits for P-Card Transactions</b><br>\
<span style="color: #aaa">Compared with Benford\'s Proposed Proportions</span>',
    labels={
        'first_two_digits': 'First Two Digits',
    },
    height=500,
    barmode='group',
    template='simple_white',
)

fig.update_layout(
    font_family='Helvetica, Inter, Arial, sans-serif',
    yaxis_title_text='Proportion',
    yaxis_tickformat=',.0%',
    legend_title=None,
    legend=dict(
        yanchor="top",
        y=0.9,
        xanchor="left",
        x=0.75
    ),
)

fig.data[0].name = 'Actual'
fig.data[1].name = 'Benford'

fig.show()

Now we see noteworthy deviations.

Judging deviation of the first two digits using MAD

In [16]:
mad2 = abs(df_benford2['actual_proportion'] - df_benford2['benford_proportion']).sum() / df_benford2.shape[0]
print(f'MAD value = {round(mad2, 5)}')
MAD value = 0.00208

Interpreting MAD value

Nigrini's study suggests the following MAD ranges and conclusions for the first two digits. Note that the ranges differ from the previous table where only the first digit was used for analysis.

MAD Range Conformity
0.0000 to 0.0012 Close confirmity
0.0012 to 0.0018 Acceptable conformity
0.0018 to 0.0022 Marginally acceptable conformity
Above 0.0022 Nonconformity
In [17]:
def MAD2_benford_interpretation(mad):
    if mad < 0.0012:
        return 'Close Conformity'
    elif mad < 0.0018:
        return 'Acceptable Conformity'
    elif mad < 0.0022:
        return 'Marginal Conformity'
    else:
        return 'Non-conforming'
In [18]:
print(f'MAD value of {round(mad2, 5)} can be interpreted as {MAD2_benford_interpretation(mad2)}.')
MAD value of 0.00208 can be interpreted as Marginal Conformity.

Although the histogram shows notable deviations, the MAD measure is within the "marginal conformity" range.


Closing thoughts

  • Benford's Law only applies to datasets with certain characteristics, like scale invariance and growth processes.
  • It's a statistical observation, not a rule, and deviations can occur.
  • It's a powerful tool for detecting anomalies, but not a foolproof method.

Citations